library(tidyverse)
library(ggplot2)
library(dplyr)
library(gridExtra)
library(forecast)
remotes::install_github("nrennie/LondonMarathon")
data(winners, package = "LondonMarathon")
data(london_marathon, package = "LondonMarathon")
The London Marathon, or the TCS London Marathon, is a truly special event that brings people from all walks of life together. It was the brainchild of visionaries Chris Brasher and John Disley back in 1981 and has since become the UK’s second-largest annual road race, second only to the famous Great North Run in Newcastle.
Usually happening in April, the marathon had a temporary switch to October in 2020, 2021, and 2022 due to the challenges posed by the COVID-19 pandemic. The course itself is known for its mostly flat terrain, winding its way around the beautiful River Thames, starting at Blackheath and ending with a grand finish at The Mall.
But the London Marathon isn’t just a race; it’s a melting pot of experiences. There’s the mass race, where people of all fitness levels challenge themselves and relish the thrill of completing a marathon. For elite long-distance runners, both men and women, there are professional races showcasing top-tier athleticism. Wheelchair races provide a platform for elite-level competition among both men and women.
Now, let’s dive into the details of the marathon with the comprehensive data:
London Marathon Data:
Winner Data:
What truly sets the London Marathon apart is its heartwarming philanthropic side. Participants aren’t just running for personal glory; they’re making a difference. Since its start, the marathon has raised over £1 billion for various charities. In 2019 alone, an incredible £66.4 million was raised, marking it as the highest single-day fundraising event in the marathon’s history.
So, as the London Marathon continues to capture hearts and inspire runners worldwide, it’s not just a race; it’s a legacy of community, achievement.
Analyzing Trends in Winning Times Over the Years: Our examination of winning times involves creating time series plots for each category or utilizing regression analysis to identify significant trends. Through careful comparison and the use of visualizations, we aim to uncover patterns and changes in winning times across various categories. Exploring Wheelchair Winners’ Performances: Shifting our focus to inclusivity, we analyze the performances of wheelchair winners to understand how their achievements have evolved. A detailed examination of winning times helps us visualize trends, celebrating the success of wheelchair participants in the marathon.
Comparing Winning Times Between Categories: Using t-tests or analysis of variance (ANOVA), we compare winning times between different categories such as Men vs. Women and Wheelchair Men vs. Wheelchair Women. This analysis enables the exploration of variations in performance and identification of any significant differences between these categories.
Correlation Analysis: To delve deeper into the relationships within the data, we conduct correlation analysis.
In our exploration of the rich history of the London Marathon, we delve into a detailed examination of different aspects. To begin, let’s uncover the evolving trends in marathon participation over the years by closely examining changes in the number of applicants and accepted participants. To give these numbers context and meaning, we calculate the percentage of participants who moved from being accepted to actually starting the race and, subsequently, the percentage of those starters who successfully finished. This analysis aims to reveal any noticeable trends or patterns, breathing life into the raw numbers.
In the realm of inclusivity, we shift our focus to wheelchair winners, seeking to understand how their performances have evolved over the years. Through a detailed analysis of winning times, we aim to visualize trends and patterns, celebrating the achievements of wheelchair participants in the marathon.
Turning our attention to the marathon heroes, we delve into the winner analysis. The spotlight is on the most successful marathon winners, and we aim to understand how their performances have evolved over time. By examining variables such as nationality, winning times, and identifying any discernible patterns, Visualizing winning times over the years adds a vivid layer to this exploration.
Geographical analysis becomes our compass as we investigate trends or patterns in the nationalities of winners. Employing visualizations through maps, we seek to represent the distribution of winners’ nationalities, painting a global picture of the marathon’s diverse champions.
In the financial realm, we track the yearly fundraising growth, exploring how the total fundraising amount has changed annually. Utilizing a line chart, we aim to capture the financial heartbeat of the London Marathon, identifying any significant spikes or drops and understanding the financial narrative that accompanies this iconic event.
we delve into long-term charity partnerships. By identifying charities that have consistently participated, we analyze their fundraising trends over time. This exploration unveils the enduring commitment of these charitable partners to the London Marathon and the impact of their contributions.
Lastly, we conduct a comparative analysis to understand how the number of male and female participants has changed over the years. Through visualizations, we aim to illustrate the evolving dynamics of gender participation in the London Marathon, showcasing the inclusive spirit of this remarkable event.
The data provided gives us insights into the London Marathon, offering details on the number of applicants, accepted participants, starters, finishers, funds raised, and the official charities involved for each year. Additionally, we have separate information on the winners across different categories such as Men, Women, Wheelchair Men, and Wheelchair Women throughout the years.
Let’s break down the key aspects:
Cases: Each row in the datasets represents a specific year of the London Marathon. There are 42 entries in the “London Marathon Data” and 165 entries in the “Winner.csv” data.
Variables in London Marathon Data:
Variables in Winner Data:
How was it collected?
As the datasets are different aspects of information about london marathon, we choose not to combine them but analyze them seperately.
For the winners dataset, all variables and observations are relevant. For the london_marathon dataset, only variables Year, Applicants, Accepted, Starters, Finishers are relevant to our research questions.
For the winners dataset, no observation is filtered out. For the london_marathon dataset, we filtered out the observation in year 2020-2022.
We created new variables, which are ratio of accepted, ratio of starters, and ratio of finishers, in the london_marathon dataset.
We chose not dealing with the missing values, in london_marathon dataset. The reason is that those values are missing systematically. No appropriate methods are candidates until we figured out why they are missing, which is beyond our capability.
Here is the data types summary.
library(knitr)
table <- data.frame('Column'=c('Category','Year','Athlete','Nationality','Time'), 'Class'=c('character','integer','character','character','character'), 'Description'=c('Category of race','Year','Name of the winner','Nationality of the winner','Winning time'),'Example'=c('Men',1981,'Dick Beardsley(Tie)','United States', '02:11:48'))
kable(table, caption = "Winners")
| Column | Class | Description | Example |
|---|---|---|---|
| Category | character | Category of race | Men |
| Year | integer | Year | 1981 |
| Athlete | character | Name of the winner | Dick Beardsley(Tie) |
| Nationality | character | Nationality of the winner | United States |
| Time | character | Winning time | 02:11:48 |
table2 <- data.frame('Column'=c('Date','Year','Applicants','Accepted','Starters','Finishers','Raised','Official.charity'),'Class'=c('character','integer','integer','integer','integer','integer','integer','character'),'Description'=c('Date of the race','Year','Number of people who applied','Number of people accepted','Number of people who started','Number of people who finished','Amount raised for charity(ÂŁ millions)','Official charity'),'Example'=c('1981-03-29','1981','20000','7747','7055','6255','46.5','SportsAid'))
kable(table2, caption = "London Marathon")
| Column | Class | Description | Example |
|---|---|---|---|
| Date | character | Date of the race | 1981-03-29 |
| Year | integer | Year | 1981 |
| Applicants | integer | Number of people who applied | 20000 |
| Accepted | integer | Number of people accepted | 7747 |
| Starters | integer | Number of people who started | 7055 |
| Finishers | integer | Number of people who finished | 6255 |
| Raised | integer | Amount raised for charity(ÂŁ millions) | 46.5 |
| Official.charity | character | Official charity | SportsAid |
In the london_marathon dataset, observation of number of applicants in year 1981-2012 are in a unit of 1000. And observations of number of applicants, accepted, starters, and finishers in year 2020 are abnormally low.
Overview of the dataset
head(winners)
## # A tibble: 6 Ă— 5
## Category Year Athlete Nationality Time
## <chr> <dbl> <chr> <chr> <times>
## 1 Men 1981 Dick Beardsley (Tie) United States 0.09152778
## 2 Men 1981 Inge Simonsen (Tie) Norway 0.09152778
## 3 Men 1982 Hugh Jones United Kingdom 0.08986111
## 4 Men 1983 Mike Gratton United Kingdom 0.09008102
## 5 Men 1984 Charlie Spedding United Kingdom 0.09024306
## 6 Men 1985 Steve Jones United Kingdom 0.08907407
winners_data <- winners |> mutate(Time=24*Time)
winners_data$Year <- as.numeric(winners_data$Year)
winners_data$Time <- as.numeric(winners_data$Time)
# Create time series plots for all categories with captions
ggplot(winners_data, aes(x = Year, y = Time, color = Category)) +
geom_line() +
ggtitle("Time Trends - All Categories") +
xlab("Year") +
ylab("Winning Time (in hours)") +
labs(caption = "Figure 1: Time trends in winning times for different categories over the years.")
Men and Women:
Both Men and Women categories show a relatively linear trend over the years. For Men, the winning time decreases from around 0.09 to 0.08, indicating a consistent improvement in performance. Similarly, for Women, the winning time decreases from approximately 0.104 to 0.096, suggesting a trend of faster winning times.
Wheelchair Men and Wheelchair Women:
The plot reveals a substantial drop in winning times for both Wheelchair Men and Wheelchair Women over the years. Wheelchair Men’s winning time decreases from 0.1389 to 0.06, indicating a significant performance improvement. Wheelchair Women’s winning time also decreases from 0.1868 to 0.069, reflecting a substantial and consistent improvement.
The overall trend suggests that there has been an improvement in athletic performance across all categories over the years, with notable drops in winning times.
boxplot(Time ~ Category, data = winners_data, desc= T,main="Boxplots of Winning Time of Each Category")
From this plot, we can see there is difference of performance between men and women.
# t-test for Men vs. Women
t_test_men_women <- t.test(Time ~ Category, data = subset(winners_data, Category %in% c("Men", "Women")))
print(t_test_men_women)
##
## Welch Two Sample t-test
##
## data: Time by Category
## t = -21.25, df = 69.816, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Men and group Women is not equal to 0
## 95 percent confidence interval:
## -0.2865984 -0.2374147
## sample estimates:
## mean in group Men mean in group Women
## 2.121004 2.383010
t-value: The t-value of -21.25 is highly significant. It indicates that the difference in mean finishing times between Men and Women is substantial, and it is unlikely to be due to random chance. The negative sign suggests that, on average, Men finish the marathon faster than Women.
Degrees of Freedom: The degrees of freedom are approximately 69.816, calculated using Welch’s correction for unequal variances.
Hypotheses:
Null Hypothesis (H0): The mean winning time for Men is equal to the mean winning time for Women. ÎĽ(Men) = ÎĽ(Women)
Alternative Hypothesis (H1): The mean winning time for Men is different from the mean winning time for Women. μ(Men) ≠μ(Women)
p-value: The p-value is extremely small (< 2.2e-16), indicating strong evidence against the null hypothesis. A low p-value indicates that the observed difference in mean winning times between Men and Women is highly unlikely to be due to random chance.
Confidence Interval: The 95% confidence interval (-0.01194160 to -0.00989228) provides a range where the true difference in mean finishing times between Men and Women is likely to be found. Since this range doesn’t include zero, it supports the idea that there is a significant difference.
Sample Estimates: The mean winning time for Men (0.08837516) is significantly different from the mean winning time for Women (0.09929210).
Conclusion: Given the very low p-value, we reject the null hypothesis. There is strong statistical evidence to suggest that there is a significant difference in mean winning times between Men and Women. The negative mean difference in the confidence interval indicates that, on average, Men have a lower winning time compared to Women.
# t-test for Wheelchair Men vs. Wheelchair Women
t_test_wheelchair <- t.test(Time ~ Category, data = subset(winners_data, Category %in% c("Wheelchair Men", "Wheelchair Women")))
print(t_test_wheelchair)
##
## Welch Two Sample t-test
##
## data: Time by Category
## t = -3.684, df = 66.574, p-value = 0.0004625
## alternative hypothesis: true difference in means between group Wheelchair Men and group Wheelchair Women is not equal to 0
## 95 percent confidence interval:
## -0.6161296 -0.1830727
## sample estimates:
## mean in group Wheelchair Men mean in group Wheelchair Women
## 1.761944 2.161546
t-value: The t-value is -3.684, indicating a substantial difference between the mean winning times of Wheelchair Men and Wheelchair Women. The negative sign suggests that, on average, Wheelchair Women have a higher winning time.
Degrees of Freedom (df): The degrees of freedom are 66.574, calculated using Welch’s method.
Null Hypothesis (H0): The mean winning times for Wheelchair Men and Wheelchair Women are equal. Alternative Hypothesis (HA): The mean winning times for Wheelchair Men and Wheelchair Women are not equal.
p-value = 0.0004625: The p-value is less than 0.05, indicating that there is strong evidence to reject the null hypothesis.
95 percent confidence interval: The range -0.025672067 to -0.007628028 represents the range within which we can be 95% confident that the true difference in means lies. In this case, it suggests that the mean winning time for Wheelchair Women is likely to be between 0.0076 and 0.0256 higher than that of Wheelchair Men.
With a small p-value and a negative t-value, there is evidence to suggest that there is a statistically significant difference in winning times between Wheelchair Men and Wheelchair Women. The mean winning time for Wheelchair Women appears to be significantly higher than that for Wheelchair Men. The 95% confidence interval provides a range for the magnitude of this difference.
# ANOVA for Men vs. Women vs. Wheelchair Men vs. Wheelchair Women
anova_model <- aov(Time ~ Category, data = subset(winners_data, Category %in% c("Men", "Women", "Wheelchair Men", "Wheelchair Women")))
summary(anova_model)
## Df Sum Sq Mean Sq F value Pr(>F)
## Category 3 8.036 2.679 24.34 4.81e-13 ***
## Residuals 161 17.714 0.110
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Null Hypothesis (H0): There is no significant difference in mean winning times among the categories.
Alternative Hypothesis (H1): There is a significant difference in mean winning times among at least one pair of categories.
I reject the null hypothesis, concluding that there are significant differences in mean winning times across at least one pair of categories (Men, Women, Wheelchair Men, Wheelchair Women).
F-Test: The F value of 24.34 is higher than expected by chance, signifying significant differences in mean winning times among categories.
p-value: With a tiny p-value (4.81e-13), there is substantial evidence against the null hypothesis, implying significant differences in mean winning times.
# Explore potential correlations between winning times and other variables.
# Convert 'Nationality' to a factor variable
winners_data$Nationality <- as.factor(winners_data$Nationality)
# Calculate correlation coefficients
cor_year_time <- cor(winners_data$Year, winners_data$Time, method = "pearson")
cor_nat_time <- cor(as.numeric(winners_data$Nationality), winners_data$Time, method = "spearman")
print(paste("Correlation between Year and Time (Pearson):", cor_year_time))
## [1] "Correlation between Year and Time (Pearson): -0.464838452893853"
print(paste("Correlation between Nationality and Time (Spearman):", cor_nat_time))
## [1] "Correlation between Nationality and Time (Spearman): -0.13136760903489"
We observed a correlation between Year and Time (Pearson) of -0.4648, indicating a moderate negative linear relationship. As the Year increases, the Time tends to decrease moderately. The closer the correlation coefficient is to -1, the stronger the negative linear relationship.
Furthermore, the correlation between Nationality and Time (Spearman) is -0.1314, suggesting a weak negative monotonic relationship. As the ranks of Nationality increase, the Time tends to decrease weakly. The closer the Spearman correlation coefficient is to -1, the stronger the negative monotonic relationship.
Here, the null hypothesis is that \(H_0\): there is no difference of athletes’ performance based on nationality. The alternative hypothesis is that \(H_A\): there is difference of athletes’ performance based on nationality.
winners_data$Time <- as.numeric(winners_data$Time)
# Performing ANOVA
anova_model <- aov(Time ~ Nationality, data = winners_data)
anova_summary <- summary(anova_model)
# Print ANOVA summary
print(anova_summary)
## Df Sum Sq Mean Sq F value Pr(>F)
## Nationality 23 9.792 0.4257 3.761 5.51e-07 ***
## Residuals 141 15.959 0.1132
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# In case the assumptions of ANOVA are not met, use Kruskal-Wallis test as a non-parametric alternative
kruskal_test <- kruskal.test(Time ~ Nationality, data = winners_data)
print(kruskal_test)
##
## Kruskal-Wallis rank sum test
##
## data: Time by Nationality
## Kruskal-Wallis chi-squared = 79.787, df = 23, p-value = 3.437e-08
Here we can see that both test give a p-value that is close to 0. So there is evidence to reject the null hypothesis, and we conclude that there is difference of athletes’ performance based on nationality.
winners_data |> group_by(Category,Nationality) |> summarise(count = n()) |> arrange(desc(count)) |> slice_head()
## # A tibble: 4 Ă— 3
## # Groups: Category [4]
## Category Nationality count
## <chr> <fct> <int>
## 1 Men Kenya 17
## 2 Wheelchair Men United Kingdom 16
## 3 Wheelchair Women United Kingdom 15
## 4 Women Kenya 14
Overview of the dataset
head(london_marathon)
## # A tibble: 6 Ă— 8
## Date Year Applicants Accepted Starters Finishers Raised
## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1981-03-29 1981 20000 7747 7055 6255 NA
## 2 1982-05-09 1982 90000 18059 16350 15116 NA
## 3 1983-04-17 1983 60000 19735 16500 15793 NA
## 4 1984-05-13 1984 70000 21142 16992 15675 NA
## 5 1985-04-21 1985 83000 22274 17500 15873 NA
## 6 1986-04-20 1986 80000 25566 19261 18067 NA
## # ℹ 1 more variable: `Official charity` <chr>
Inst from the data
cat("missing value count:", sum(is.na(london_marathon)))
## missing value count: 43
missing_percentage <- colMeans(is.na(london_marathon)) * 100
plot_data <- data.frame(
variable = names(missing_percentage),
percentage = missing_percentage
)
ggplot(plot_data, aes(x = variable, y = percentage)) +
geom_bar(stat = "identity", fill = "skyblue", width = 0.5) +
labs(title = "Percentage of missing values in London Marathon", y = "Percentage") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
So we can see that there are some missing values regarding the London Marathon.
the Raised means amount raised for charity (ÂŁ millions), and the Official charity means Official charity.
So we can think that if Official charity is NULL, it means there is no Official charity, then Raised should be 0 (or remove these values when analyzing Raised). If Official charity is not empty, and Raised is empty, then Raised Non-public, then we can fill it in using the average of other data.
For the missing other fields, we can see that the fields are the same. According to the official documentation (github), the data for the last two years is missing, so we should filter out the data for these two year (about london_marathon).
london_marathon1 <- london_marathon %>% group_by(`Official charity`) %>% mutate(Raised=ifelse(is.na(Raised) & !is.na(`Official charity`), mean(london_marathon$Raised, na.rm=T), Raised))
london_marathon1 <- london_marathon1 %>% mutate(Raised=ifelse(is.na(`Official charity`), 0, Raised))
london_marathon1 <- london_marathon1 %>% mutate(`Official charity`=ifelse(is.na(`Official charity`), "None", `Official charity`))
london_marathon1 <- london_marathon1 %>% filter(Year < max(london_marathon1$Year) - 1)
cat("new missing value count:", sum(is.na(london_marathon1)))
## new missing value count: 0
ggplot(london_marathon1, aes(x = Year, y = Raised)) +
geom_line() +
labs(title = "Raised Over the Years in London Marathon",
x = "Year",
y = "Raised") +
theme_minimal()
From the line chart, we can see that the Raised data we filled is not helpful in analyzing the changes in Raised data. If we want to analyze the changes in this part, we only need to focus on the data from 2007 to 2019.
filter_data <- london_marathon1 %>% filter(Year >= 2007 & Year <= 2019)
filter_data
## # A tibble: 13 Ă— 8
## # Groups: Official charity [13]
## Date Year Applicants Accepted Starters Finishers Raised
## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2007-04-22 2007 128000 50039 36396 35729 46.5
## 2 2008-04-13 2008 120000 48630 35037 34637 46.7
## 3 2009-04-26 2009 155000 49995 35884 35404 47.2
## 4 2010-04-25 2010 163000 51378 36956 36666 50.6
## 5 2011-04-17 2011 163926 50532 35303 34872 51.8
## 6 2012-04-22 2012 170150 50200 37227 36812 52.8
## 7 2013-04-21 2013 167449 48323 34631 34381 53
## 8 2014-04-13 2014 169682 49872 36337 35977 53.2
## 9 2015-04-26 2015 172888 51696 38020 37793 54.1
## 10 2016-04-24 2016 247069 53152 39523 39140 59.4
## 11 2017-04-23 2017 253930 53229 40048 39487 61.5
## 12 2018-04-22 2018 386050 54685 40926 40220 63.7
## 13 2019-04-28 2019 414168 56398 42906 42549 66.4
## # ℹ 1 more variable: `Official charity` <chr>
ggplot(filter_data, aes(x=Year, y=Raised, group=1))+
geom_line() +
geom_point() +
theme()
It can be seen that fundraising increases every year.
It can be seen from this plot that the winning time of men and women in wheelchairs has become significantly faster, while the winning time of normal men and women has gradually become faster, indicating that people are gradually paying attention to the competitiveness of marathon sports.
# london marathon plot
london_plot <- london_marathon %>%
filter(Year < 2020) %>%
mutate(Year = factor(Year))
ggplot(
data = london_plot,
mapping = aes(y = Year)
) +
geom_point(aes(x = Starters),
colour = "#008080"
) +
geom_point(aes(x = Finishers),
colour = "#800080"
) +
geom_segment(aes(
x = Starters,
xend = Finishers,
y = Year,
yend = Year
)) +
labs(
x = "Number of runners",
title = "Number of London Marathon Starters and Finishers"
) +
theme_minimal() +
theme(
axis.title.y = element_blank(),
plot.background = element_rect(fill = "white", colour = "white"),
panel.background = element_rect(fill = "white", colour = "white")
)
par(mfrow=c(2,2))
sub_london <- london_marathon[1:39,]
p1 <- ggplot(data= sub_london, aes(x=Year, y=Applicants ))+geom_point()+geom_line()+
labs(title = "Applicants")
p2 <- ggplot(data= sub_london, aes(x=Year, y=Accepted ))+geom_point()+geom_line()+
labs(title = "Accepted")
p3 <- ggplot(data= sub_london, aes(x=Year, y=Starters ))+geom_point()+geom_line()+
labs(title = "Starters")
p4 <- ggplot(data= sub_london, aes(x=Year, y=Finishers))+geom_point()+geom_line()+
labs(title = "Finishers")
grid.arrange(p1, p2, p3, p4, nrow=2, ncol=2)
In this part, we will use linear models to fit the data and predict the number of applicants, accepted, starters, and finishers in year 2024.
From the previous part, we can see that the number of accepted, starters, and finishers versus year are generally linear. So the models we will use are \[y_i=\beta_0+\beta_1x_i+\epsilon_i, \quad \epsilon_i \sim N(0,\sigma^2)\] For number of applicants, the model we we will use is \[y_i=\beta_0+\beta_1x_i+\beta_2x_i^2+\beta_3x_i^3+\epsilon_i, \quad \epsilon_i \sim N(0,\sigma^2)\]
Moreover, we will check the normality of the residuals.
fit1 <- lm(Applicants ~ Year + I(Year^2)+ I(Year^3), data = sub_london)
ggplot(sub_london, aes(x = Year, y = Applicants)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", formula = y ~ poly(x, 3), se = FALSE, color = "blue") +
theme_minimal() +
labs(title = "Polynomial Regression Model", x = "Year", y = "Number of Applicants")
new <- data.frame('Year'= 2024)
predict(fit1, newdata = data.frame(new), interval = c("confidence"), level = 0.95, type="response")
## fit lwr upr
## 1 571826 498633.6 645018.4
The number of applicants is estimated to be 571826 in 2024. A \(95\%\) prediction interval is \((498633, 645018)\).
plot(fit1, which=1)
plot(fit1, which=2)
fit2 <- lm(Accepted ~ Year, data = sub_london)
ggplot(sub_london, aes(x = Year, y = Accepted)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", formula = 'y ~ x',se = FALSE, color = "blue") +
theme_minimal() +
labs(title = "SLR Model for Accepted", x = "Year", y = "Number of Accepted")
##### Prediction
new <- data.frame('Year'= 2024)
predict(fit2, newdata = data.frame(new), interval = c("confidence"), level = 0.95, type="response")
## fit lwr upr
## 1 63762.82 60926.6 66599.03
The number of applicants is estimated to be 63762 in 2024. A \(95\%\) prediction interval is \((60926, 66599)\).
plot(fit2, which=1)
plot(fit2, which=2)
fit3 <- lm(Starters ~ Year, data = sub_london)
ggplot(sub_london, aes(x = Year, y = Starters)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", formula = 'y ~ x',se = FALSE, color = "blue") +
theme_minimal() +
labs(title = "SLR Model for Starters", x = "Year", y = "Number of Starters")
##### Prediction
new <- data.frame('Year'= 2024)
predict(fit3, newdata = data.frame(new), interval = c("confidence"), level = 0.95, type="response")
## fit lwr upr
## 1 46119.77 44331.73 47907.82
The number of applicants is estimated to be 46119 in 2024. A \(95\%\) prediction interval is \((44331, 47907)\).
plot(fit3, which=1)
plot(fit3, which=2)
fit4 <- lm(Finishers ~ Year, data = sub_london)
ggplot(sub_london, aes(x = Year, y = Finishers)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm",formula = 'y ~ x', se = FALSE, color = "blue") +
theme_minimal() +
labs(title = "SLR Model for Finishers", x = "Year", y = "Number of Finishers")
new <- data.frame('Year'= 2024)
predict(fit4, newdata = data.frame(new), interval = c("confidence"), level = 0.95, type="response")
## fit lwr upr
## 1 46056.64 44306.16 47807.13
The number of applicants is estimated to be 46056 in 2024. A \(95\%\) prediction interval is \((44306, 47807)\).
plot(fit4, which=1)
plot(fit4, which=2)
Let \(x\) denote the number of applicants in a year, \(y_1\) the number of accepted in a year, \(y_2\) the number of starters in a year, \(y_3\) the number of finishers in a year. We define the ratio of accepted as \(r_1=\frac{y_1}{x}\), ratio of starters as \(r_2=\frac{y_2}{y_1}\), ratio of finishers as \(r_3=\frac{y_3}{y_2}\).
london_marathon <- london_marathon |> mutate(`Ratio of Accepted`=Accepted/Applicants)|>mutate(`Ratio of Starters`=Starters/Accepted)|> mutate(`Ratio of Finishers`=Finishers/Starters)
print(london_marathon[, c("Year", "Ratio of Accepted", "Ratio of Starters","Ratio of Finishers")])
## # A tibble: 42 Ă— 4
## Year `Ratio of Accepted` `Ratio of Starters` `Ratio of Finishers`
## <dbl> <dbl> <dbl> <dbl>
## 1 1981 0.387 0.911 0.887
## 2 1982 0.201 0.905 0.925
## 3 1983 0.329 0.836 0.957
## 4 1984 0.302 0.804 0.922
## 5 1985 0.268 0.786 0.907
## 6 1986 0.320 0.753 0.938
## 7 1987 0.355 0.757 0.912
## 8 1988 0.411 0.749 0.932
## 9 1989 0.441 0.770 0.928
## 10 1990 0.478 0.760 0.944
## # ℹ 32 more rows
Accepted to Started: Over the years, there is a general trend of a decline in the percentage of accepted participants who actually start the race.
Started to Finished:
The percentage of starters who finish the race seems to be relatively stable, generally ranging from the high 80s to mid-90s.
Notable Observation (2020):
In 2020, there’s a notable observation in the “Accepted to Started” column. The percentage is 100%, indicating that all accepted participants started the race. However, the “Started to Finished” percentage is lower (79.22%), indicating that not all starters completed the race.
This discrepancy could be due to the impact of external factors such as the COVID-19 pandemic, leading to a different pattern in that particular year.
sub_london1 <- london_marathon[1:39,]
p1 <- ggplot(data= sub_london1, aes(x=Year, y=`Ratio of Accepted` ))+geom_point()+geom_line()+
labs(title = "Ratio of Accepted")
p2 <- ggplot(data= sub_london1, aes(x=Year, y=`Ratio of Starters` ))+geom_point()+geom_line()+
labs(title = "Ratio of Starters")
p3 <- ggplot(data= sub_london1, aes(x=Year, y=`Ratio of Finishers` ))+geom_point()+geom_line()+
labs(title = "Ratio of Finishers")
grid.arrange(p1, p2, p3, ncol=1)
In this part, we will focus on four time series, which are number of applicants each year, ratio of accepted each year, ratio of starters each year, and ratio of finishers each year. We will use ARIMA(AutoRegressive Integrated Moving Average) model to analyze and make forecasts for the time series in London marathon dataset. For each time series, we will draw a forecasting plot of 10 years, check the autocorrelation function of the residuals, check the normality of the residuals, and do a Box-Ljung test(a statistical test used to check whether any of a group of autocorrelations of a time series are different from zero).
# a function to plot the forecast errors
plotForecastErrors <- function(forecasterrors)
{
# make a histogram of the forecast errors:
mybinsize <- IQR(forecasterrors)/4
mysd <- sd(forecasterrors)
mymin <- min(forecasterrors) - mysd*5
mymax <- max(forecasterrors) + mysd*3
# generate normally distributed data with mean 0 and standard deviation mysd
mynorm <- rnorm(10000, mean=0, sd=mysd)
mymin2 <- min(mynorm)
mymax2 <- max(mynorm)
if (mymin2 < mymin) { mymin <- mymin2 }
if (mymax2 > mymax) { mymax <- mymax2 }
# make a red histogram of the forecast errors, with the normally distributed data overlaid:
mybins <- seq(mymin, mymax, mybinsize)
hist(forecasterrors, col="red", freq=FALSE, breaks=mybins)
# freq=FALSE ensures the area under the histogram = 1
# generate normally distributed data with mean 0 and standard deviation mysd
myhist <- hist(mynorm, plot=FALSE, breaks=mybins)
# plot the normal curve as a blue line on top of the histogram of forecast errors:
points(myhist$mids, myhist$density, type="l", col="blue", lwd=2)
}
app <- london_marathon$Applicants[1:40]
appts <- ts(app, ,frequency=1,start=1981)
appts <- na.omit(appts)
# fit the an automatic arima model
apptsarima <- auto.arima(appts)
# make forcast about the number of applicants in 10 years
fc <- forecast(apptsarima, h=10)
plot(fc, xlab = "Year", ylab= "Number of Applicants", main = "Forecasts of Number of Applicants")
The forecasting plot shows that there will be about 6000 annual increase of number of applicants for London marathon in the next 10 years.
# check autocorrelation function with lag from 1 to 20
acf(fc$residuals, lag.max=20,main="ACF Check for Resisuals")
# Box-Ljung test
Box.test(fc$residuals, lag=20, type="Ljung-Box")
##
## Box-Ljung test
##
## data: fc$residuals
## X-squared = 5.7183, df = 20, p-value = 0.9992
# histogram of forecasterrors
plotForecastErrors(fc$residuals)
roa <- london_marathon$`Ratio of Accepted`
roats <- ts(roa, ,frequency=1,start=1981)
roats<- na.omit(roats)
# fit the an automatic arima model
roatsarima <- auto.arima(roats)
# make forcast about the number of applicants in 10 years
fc2 <- forecast(roatsarima, h=10)
plot(fc2,xlab = "Year", ylab= "Number of Applicants", main = "Forecasts of Ratio of Accepted")
The forecasting plot shows that the ratio of accepted for London marathon in the next 10 years will generally remain constant.
# check autocorrelation function with lag from 1 to 20
acf(fc2$residuals, lag.max=20,main="ACF Check for Resisuals")
# Box-Ljung test
Box.test(fc2$residuals, lag=20, type="Ljung-Box")
##
## Box-Ljung test
##
## data: fc2$residuals
## X-squared = 10.473, df = 20, p-value = 0.9588
# histogram of forecasterrors
plotForecastErrors(fc2$residuals)
ros <- london_marathon$`Ratio of Starters`
rosts <- ts(ros, ,frequency=1,start=1981)
rosts <- na.omit(rosts)
rostsarima <- auto.arima(rosts)
fc3 <- forecast(rostsarima, h=10)
plot(fc3,xlab = "Year", ylab= "Number of Applicants", main = "Forecasts of Ratio of Starters")
The forecasting plot shows that the ratio of accepted for London marathon in the next 10 years will decrease.
acf(fc3$residuals, lag.max=20,main="ACF Check for Resisuals")
Box.test(fc3$residuals, lag=20, type="Ljung-Box")
##
## Box-Ljung test
##
## data: fc3$residuals
## X-squared = 8.5628, df = 20, p-value = 0.9874
plotForecastErrors(fc3$residuals)
rof <- london_marathon$`Ratio of Finishers`
rofts <- ts(rof, ,frequency=1,start=1981)
rofts <- na.omit(rofts)
roftsarima <- auto.arima(rofts)
fc4 <- forecast(roftsarima, h=10)
plot(fc4,xlab = "Year", ylab= "Number of Applicants", main = "Forecasts of Ratio of Finishers")
The forecasting plot shows that the ratio of finishers for London marathon in the next 10 years will remain constant.
acf(fc4$residuals, lag.max=20,main="ACF Check for Resisuals")
Box.test(fc4$residuals, lag=20, type="Ljung-Box")
##
## Box-Ljung test
##
## data: fc4$residuals
## X-squared = 2.745, df = 20, p-value = 1
plotForecastErrors(fc4$residuals)
Since the correlograms in models above show that none of the sample autocorrelations for lags 1-20 exceed the significance bounds, and the p-values for the Ljung-Box test are close to 1, we can conclude that there are very little evidences for non-zero autocorrelations in the forecast errors at lags 1-20.
The histograms of the time series show that the forecast errors are roughly normally distributed and the means seem to be close to zero. Therefore, it is plausible that the forecast errors are normally distributed with mean zero and constant variance.
Since successive forecast errors do not seem to be correlated, and the forecast errors seem to be normally distributed with means zeros and constant variances, the ARIMA(s) do seem to provide adequate predictive models for the dataset.
Though there are some overlaps between our team members works, we came up the question independently and did the analysis independently. The report is an integration of our work. Here are the jobs that we did:
data wrangling part
winner dataset
- plot of time versus year
- boxplots of time based on category
- analysis of difference of athletes’ performance based on
nationality
- analysis of difference between performance of men and women
- summarising coutries that have the most winners
london_marathon dataset
- plots of variables(applicants,accepted,starters,finishers) versus
year
- creating and plots of new variables
- linear models for the dataset
- time series analysis and forecasting
integrated team’s works and composed report V1.0